Filtering data

In the last section we looked at how to act on entire columns at once. For example when we did:

tips["total_bill"] * 100

it applied the multiplication to every row, multiplying each number by 100.

Sometimes we don’t want to have to deal with entire columns at once, we might only want to grab a subset of the data and look in just that part. For example, with the tips data, we might think that the day of the week will affect the data so we just want to grab the data for Saturdays.

In Pandas there are two steps to asking a question like this.

create a filter which describes the question you want to ask
apply that filter to the data to get just the bits you are interested in

You create a filter by performing some operation on your DataFrame or a column within it. To ask about only those rows which refer to Saturday, you grab the day column and compare it to "Sat":

import pandas as pd
tips = pd.read_csv("./data/tips.csv")

tips["day"] == "Sat"

0      False
1      False
2      False
3      False
4      False
       ...  
239     True
240     True
241     True
242     True
243    False
Name: day, Length: 244, dtype: bool

This has created a filter object (sometimes called a mask or a boolean array) which has True set for the rows where the day is Saturday and False elsewhere.

We could save this filter as a variable:

sat_filter = tips["day"] == "Sat"

We can use this to filter the DataFrame as a whole. tips["day"] == "Sat" has returned a Series containing booleans. Passing it back into tips as an indexing operation will use it to filter based on the day column, only keeping those rows which contained True in the filter:

tips[sat_filter]

	total_bill	tip	day	time	size
19	20.65	2.34	Sat	Dinner	3
20	17.92	2.86	Sat	Dinner	2
21	20.29	1.92	Sat	Dinner	2
22	15.77	1.56	Sat	Dinner	2
23	39.42	5.31	Sat	Dinner	4
...	...	...	...	...	...
238	35.83	3.27	Sat	Dinner	3
239	29.03	4.14	Sat	Dinner	3
240	27.18	1.40	Sat	Dinner	2
241	22.67	1.40	Sat	Dinner	2
242	17.82	1.22	Sat	Dinner	2

87 rows × 5 columns

Notice that it now says that the table only has 87 rows, down from 244. However, the index has been maintained. This is because the row labels are connected to the row, they’re not just row numbers.

It is more common to do this in one step, rather than creating and naming a filter object. So the code becomes:

tips[tips["day"] == "Sat"]

	total_bill	tip	day	time	size
19	20.65	2.34	Sat	Dinner	3
20	17.92	2.86	Sat	Dinner	2
21	20.29	1.92	Sat	Dinner	2
22	15.77	1.56	Sat	Dinner	2
23	39.42	5.31	Sat	Dinner	4
...	...	...	...	...	...
238	35.83	3.27	Sat	Dinner	3
239	29.03	4.14	Sat	Dinner	3
240	27.18	1.40	Sat	Dinner	2
241	22.67	1.40	Sat	Dinner	2
242	17.82	1.22	Sat	Dinner	2

87 rows × 5 columns

This has given us back our subset of data as another DataFrame which can used in exactly the same way as the previous one (further filtering, summarising etc.).

Exercise

Select the data for only Thursdays.
Calculate the mean of the tip column for Thursdays
Compare this with the mean of the tip column for Saturdays

Answer

import pandas as pd

tips = pd.read_csv("./data/tips.csv")

thurs = tips[tips["day"] == "Thur"]
thurs

	total_bill	tip	day	time	size
77	27.20	2.80	Thur	Lunch	4
78	22.76	2.10	Thur	Lunch	2
79	17.29	1.90	Thur	Lunch	2
80	19.44	2.10	Thur	Lunch	2
81	16.66	2.38	Thur	Lunch	2
...	...	...	...	...	...
202	13.00	1.40	Thur	Lunch	2
203	16.40	1.75	Thur	Lunch	2
204	20.53	2.80	Thur	Lunch	4
205	16.47	2.26	Thur	Lunch	3
243	18.78	2.10	Thur	Dinner	2

62 rows × 5 columns

thurs["tip"].mean()

1.9398387096774192

tips[tips["day"] == "Sat"]["tip"].mean()

2.095402298850575

Other filters

As well as filtering with the == operator (which only checks for exact matches), you can do other types of comparisons. Any of the standard Python comparisons will work (i.e. ==, !=, <, <=, >, >=).

To grab only the rows where the total bill is less than £8 we can use <:

tips[tips["total_bill"] < 8]

	total_bill	tip	day	time	size
67	3.07	0.70	Sat	Dinner	1
92	5.75	0.70	Fri	Dinner	2
111	7.25	0.70	Sat	Dinner	1
149	7.51	1.40	Thur	Lunch	2
172	7.25	3.60	Sun	Dinner	2
195	7.56	1.01	Thur	Lunch	2
218	7.74	1.01	Sat	Dinner	2

Exercise

Filter the data to only include parties of 5 or more people.

Answer

import pandas as pd

tips = pd.read_csv("./data/tips.csv")

tips[tips["size"] >= 5]

	total_bill	tip	day	time	size
125	29.80	2.94	Thur	Lunch	6
141	34.30	4.69	Thur	Lunch	6
142	41.19	3.50	Thur	Lunch	5
143	27.05	3.50	Thur	Lunch	6
155	29.85	3.60	Sun	Dinner	5
156	48.17	3.50	Sun	Dinner	6
185	20.69	3.50	Sun	Dinner	5
187	30.46	1.40	Sun	Dinner	5
216	28.15	2.10	Sat	Dinner	5

Combining filters

If you want to apply multiple filters, for example to select only “Saturdays with small total bills” you can do it in one of two different ways. Either split the question into multiple steps, or ask it all at once.

Let’s do it multiple steps first since we already have tools we need for that:

sat_tips = tips[tips["day"] == "Sat"]  # First grab the Saturday data and save it as a variable
sat_tips[sat_tips["total_bill"] < 8]  # Then act on the new DataFrame as use it as before

	total_bill	tip	day	time	size
67	3.07	0.70	Sat	Dinner	1
111	7.25	0.70	Sat	Dinner	1
218	7.74	1.01	Sat	Dinner	2

Or, you can combine the questions together using the & operator with a syntax like:

df[(filter_1) & (filter_2)]

so in our case filter 1 is tips["day"] == "Sat" and filter 2 is tips["total_bill"] < 8 so it becomes:

tips[(tips["day"] == "Sat") & (tips["total_bill"] < 8)]

	total_bill	tip	day	time	size
67	3.07	0.70	Sat	Dinner	1
111	7.25	0.70	Sat	Dinner	1
218	7.74	1.01	Sat	Dinner	2

If you want to do an “or” operation, then instead of & you can use |.

Exercise 3

Filter the data to only include parties of 4 or more people which happened at lunch time.

Hint: The size and time columns are what you want to use here.

Answer

import pandas as pd

tips = pd.read_csv("./data/tips.csv")

tips[(tips["size"] >= 4) & (tips["time"] == "Lunch")]

	total_bill	tip	day	time	size
77	27.20	2.80	Thur	Lunch	4
85	34.83	3.62	Thur	Lunch	4
119	24.08	2.04	Thur	Lunch	4
125	29.80	2.94	Thur	Lunch	6
141	34.30	4.69	Thur	Lunch	6
142	41.19	3.50	Thur	Lunch	5
143	27.05	3.50	Thur	Lunch	6
197	43.11	3.50	Thur	Lunch	4
204	20.53	2.80	Thur	Lunch	4

DataFrame indexing

When we use the square bracket syntax on a DataFrame directly there are a few different types of object that can be passed:

A single string: This will select a single column form the DataFrame, returning a Series object.
A list of strings: This will select those columns by name, returning a DataFrame.
A filter (a Series of True/False): This will filter the table as a whole, returning a DataFrame with only the rows matching Trueincluded.

These are provided as shortcuts as they are the most common operations to do an a DataFrame. This is why some of them operate on columns and other on rows.

If you want to be explicit about which axis you are acting on, you can pass these same types of objects to the .loc[rows, columns] attribute with one argument per axis. This means that

tips[sat_filter]

is equivalent to

tips.loc[sat_filter]

and that

tips["size"]

is equivalent to

tips.loc[:, "size"]

The full set of rules for DataFrame.loc are in the documentation.